CUDAの実行モデルは、あなたのコンピュータを高性能な異種システムに変換します。次のように想像してください: 大指揮者(ホスト/CPU) そして 数千人の軍隊(デバイス/GPU)大指揮者は複雑な論理や意思決定を担当し、軍隊は膨大で繰り返し行われるタスクを同時に実行します。
1. 構造上の違い
ホスト ホスト はレイテンシ最適化されたCPUであり、複雑な制御フローと逐次的タスクに適しています。逆に、 デバイス デバイスはスループット最適化されたGPUで、数多くの単純なコアを内蔵しており、巨大なデータセットに対して同じ命令を同時に実行するように設計されています。
2. 実行のリズム
CUDAプログラムは一連のフェーズとして機能します。実行は"逐次コード"のためにホスト上で開始されます。プログラムが"並列カーネル"に到達すると、 グリッド スレッドのグリッドをデバイスに起動します。デバイスが膨大なワークロードを終了すると、制御はホストに戻ります。
3. パフォーマンスの特化
このモデルは両方の長所を活用します:CPUはシステムリソースや複雑な分岐を管理し、一方でGPUは SPMD(単一プログラム、多数データ) ロジックによってデータ要素を並列処理します。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
Which architecture is characterized as being 'throughput-optimized'?
The Host (Intel® CPU)
The Device (NVIDIA® GPU)
The System RAM
The PCIe Bus
✅ Correct!
Correct! GPUs are designed to maximize the total amount of work (throughput) done per unit of time by processing thousands of data points simultaneously.❌ Incorrect
The Host (CPU) is 'latency-optimized' to minimize the time a single thread takes to execute.QUESTION 2
The reader should complete Part 1 of the MatrixMultiplication() example in Figure 3.6 with similar declarations of an Nd and a Pd pointer variable as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with mandatory calls.
float *Nd, *Pd; cudaMalloc((void**)&Nd, size); ... cudaFree(Nd);
float Nd, Pd; malloc(&Nd, size); ... free(Nd);
float *Nd, *Pd; cudaMemcpy(Nd, Pd, size); ... delete Nd;
int Nd, Pd; Nd = new float[size]; ... free(Nd);
✅ Correct!
Exactly. You must declare pointers for the device, use cudaMalloc with a double-pointer cast, and use cudaFree to release the memory.❌ Incorrect
Standard C malloc/free or C++ new/delete cannot be used to manage Device (GPU) memory.QUESTION 3
In the CUDA execution model, where does a program always begin its execution?
On the Device (GPU)
Simultaneously on both
On the Host (CPU)
In the Global Memory
✅ Correct!
Correct. Execution starts with the serial code on the Host (CPU).❌ Incorrect
The GPU only begins work when a Kernel is specifically launched by the Host.QUESTION 4
What happens when the Host encounters a phase with rich data parallelism?
It speeds up its clock frequency.
It launches a Kernel onto the Device.
It stores the data in the Host Cache.
It converts the code to Python.
✅ Correct!
Yes! The Host 'offloads' the parallel work by launching a kernel on the massive core array of the GPU.❌ Incorrect
The CPU is not optimized for massive data parallelism; it offloads such work to the Device.QUESTION 5
A student attempts to launch a 1024x1024 matrix multiplication on G80 hardware using 1024 blocks, where each thread calculates one element. Why will this fail?
The G80 cannot handle 1024 blocks.
The total number of threads exceeds 1 million.
The configuration results in 1024 threads per block, exceeding the 512 hardware limit.
Matrix multiplication is not data parallel.
✅ Correct!
Precisely. 1,048,576 elements divided by 1024 blocks results in 1024 threads per block, which exceeds the G80 architecture limit of 512.❌ Incorrect
Check the thread-per-block limit for the G80 architecture: it is 512.Case Study: High-Resolution Fluid Dynamics
Optimizing a Heterogeneous Simulation
You are developing a fluid dynamics engine. The simulation involves: (A) Calculating the user interface and file logging, (B) Computing the pressure gradients for 20 million fluid cells, and (C) Updating the simulation time-step based on global convergence tests. You must decide how to map these tasks to the CUDA execution model.
Q
1. Which task (A, B, or C) should definitely remain on the Host, and why?
Solution:
Task A (UI and Logging) and Task C (Time-step logic) should remain on the Host. These tasks are serial in nature, involve complex I/O and control logic, and do not benefit from throughput optimization. The Host is designed to minimize the latency of these single-threaded tasks.
Task A (UI and Logging) and Task C (Time-step logic) should remain on the Host. These tasks are serial in nature, involve complex I/O and control logic, and do not benefit from throughput optimization. The Host is designed to minimize the latency of these single-threaded tasks.
Q
2. How does the 'alternating phases' concept apply to the interaction between tasks B and C?
Solution:
The program enters a loop where the Host launches Task B (the parallel pressure kernel) on the Device. Once Task B completes (synchronization), control returns to the Host to perform Task C (serial convergence check). This repeats for every time-step in the simulation.
The program enters a loop where the Host launches Task B (the parallel pressure kernel) on the Device. Once Task B completes (synchronization), control returns to the Host to perform Task C (serial convergence check). This repeats for every time-step in the simulation.